Today we will go through a series of steps that are familiar to many scientist doing empirical work. We will pretend to have collected some data1, clean them, visualize them, and analyze them using common statistical models. We will do all these things with the tidyverse, a collection of R packages for data manipulation and plotting that aims at being easily readable not only for machines, but also for humans2. These operations will be run in an environment that ensures computational reproducibility: if you have the initial dataset, you will be able to reproduce the final results with a mouse click.
On your computer, you should already have installed R and RStudio (if not, follow the instructions for your operating system here and here).
One possible way to work on a project is to open a new .R file and write your code there. However, there are a few disadvantages:
setwd(), e.g., setwd("NAME/OF/YOUR/DIRECTORY")cat("\014") to clear the consolerm(list = ls()) to clear the environment (but it does not unload active packages!)You could do all those things, but… should you?
Jenny Bryan explains the reasons (without threats to commit arson) in this blog post. The main idea is to package code in a self-contained environment that everyone (including your future self) can run without changing their own environment.
Following Jenny Brian’s suggestion (and to avoid her wrath) we can create a project directly from RStudio:
If you check the directory you just created, you will see a file called YOURNAME_MPI2020_intro_tidyverse.Rproj. If your RStudio session is still open, you will see that you already are in that directory.
MAGIC
It is time to install the first package of this session: here. For a brief overview of what the package can do, see this GitHub repository (authored by Jenny Bryan!).
How to install packages:
It’s good to know where your packages are. Type .libPaths().
## [1] "/home/aschetti/R/x86_64-pc-linux-gnu-library/3.6"
## [2] "/usr/local/lib/R/site-library"
## [3] "/usr/lib/R/site-library"
## [4] "/usr/lib/R/library"
The data are in .dat format, so we are going to load them using read.table.
att <-
read.table(
here::here("raw-data", "MixedAttitude.dat"), # file name
header = TRUE # data has headers
)If you don’t know what a function does, access the help (e.g., ?read.table).
Let’s have a look at the data. If you have RStudio, you can click on the name of the dataset in the Environment window (by default on the upper right corner).
You can also type the name of the dataset in the console:
These are the data of 21 participants who saw neutral, positive, or negative advertisement of beer, wine, and water in 3 separate sessions. They were asked to rate the drinks on a scale ranging from – 100 (dislike very much) through 0 (neutral) to 100 (like very much). Researchers wanted to reduce binge drinking in teenagers, so they hoped that pairing negative imagery with alcoholic beverages would lower these ratings3.
Looking at the data, you can see that the ratings of participant #99 are a bit strange… perhaps there was a technical problem, i.e., the rating scale was out of the -100/100 range.
Let’s filter this participant out. More specifically, we keep all participants that are not participant #99.
Verify that it worked.
The symbol %>% is the pipe operator , which allows to serially concatenate functions applied to the same data frame. Read it as “and then”. In the example below, we called the data frame att and then applied the filter function to discard participant #99.
For the exercises in this course, we will only need a subset of variables in this dataset.
Researchers were interested in reducing binge drinking, so we will focus our attention on negative imagery and one alcoholic beverage. We choose beer, because we are in Belgium. We also need control conditions, i.e., neutral imagery and water.
In the next code, we will simultaneously select and rename the variables we want to keep.
att.filter.select <-
att.filter %>%
select(
participant = ssj, # new name = old name
gender,
beer_negative = beerneg,
beer_neutral = beerneut,
water_negative = waterneg,
water_neutral = waterneu
)
head(att.filter.select, n = 5) # show only first 5 rowsLet’s look at the variable gender:
## [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
A string of 1s and 2s. It’s easy to get a glimpse of this variable because there are not many observations. When you have more observations, better use unique:
## [1] 1 2
What type of variable is it?
## [1] "integer"
Class integer has only integer values (i.e., no decimals or complex numbers).
There are many variable types in R. An interesting one is factor, i.e., variables which can have only a limited number of different values… they are basically categorical variables. In our case, it would make sense to consider gender as a factor with two levels, male and female. Let’s transform it:
# Field et al. do not specify the coding,
# so we will assume that 1 is female and 2 is male
att.filter.select.recode <-
att.filter.select %>%
mutate(sex = recode( # create a new variable 'sex'
factor(gender), # convert variable 'gender' as factor
"1" = "female", "2" = "male"
)) # assign labels to factor levels
head(att.filter.select.recode, n = 5)Now we have two variables with identical information… let’s get rid of gender:
att.filter.select.recode <- # over-write the previous data frame
att.filter.select.recode %>%
select(-gender) # discard unused variables with '-'
head(att.filter.select.recode, n = 5)Data frames can be in wide or long format:
Our data are now in wide format. However, the packages we are going to use in this course need data in long format. Let’s convert from wide to long:
att.filter.select.recode.long <-
att.filter.select.recode %>%
gather(
key = "condition", # name of new variable with all condition levels
value = "ratings", # name of new variable with all values
beer_negative:water_neutral
) # variables to be collapsed
head(att.filter.select.recode.long, n = 5)The column ratings contains the values of our dependent variable, whereas condition contains all our conditions.
This experiment has two independent variables, drink (beer or water) and imagery (neutral or negative). In our analysis, we wish to know the separate contribution of these two independent variables and their interaction.
So, we need to separate condition into 2 variables:
att.filter.select.recode.long.sep <-
att.filter.select.recode.long %>%
separate(condition, c("drink", "imagery")) %>% # the separation occurs here
# over-write participant, drink, and imagery
# and transform them to factors
mutate(
participant = factor(participant),
drink = factor(drink),
imagery = factor(imagery)
)
head(att.filter.select.recode.long.sep, n = 5)The original data are saved in a .dat file. This format is not what the cool kids use. Let’s save our processed data as .csv.
write_csv(
att.filter.select.recode.long.sep,
paste0(getwd(), "/data_attitude.csv")
) # full path & file nameIf you want to read from .csv, use read_csv.
Thanks to the versatility of the tidyverse (especially the pipe operator), all the above operations can be performed in one go:
att.filter.select.recode.long.sep <-
read.table(
here::here("raw-data", "MixedAttitude.dat"),
header = TRUE
) %>%
filter(ssj != 99) %>%
select(
participant = ssj,
gender,
beer_negative = beerneg,
beer_neutral = beerneut,
water_negative = waterneg,
water_neutral = waterneu
) %>%
mutate(sex = recode(
factor(gender),
"1" = "female", "2" = "male"
)) %>%
select(-gender) %>%
gather(
key = "condition",
value = "ratings",
beer_negative:water_neutral
) %>%
separate(condition, c("drink", "imagery")) %>%
mutate(
participant = factor(participant),
drink = factor(drink),
imagery = factor(imagery)
) %>%
write_csv(
., # the point indicates the current data frame
paste0(getwd(), "/data_attitude.csv")
)
head(att.filter.select.recode.long.sep, n = 5)It is often required to provide summary statistics of the data. How to do it with tidyverse?
summary.att.filter.select.recode.long.sep <-
att.filter.select.recode.long.sep %>%
group_by(drink, imagery) %>% # group according to conditions
summarize(
n = n(), # number of observations
mean = mean(ratings), # mean
sd = sd(ratings), # standard deviation
sem = sd / sqrt(n), # standard error of the mean
min = min(ratings), # range (min)
max = max(ratings), # range (max)
ci.95 = 1.96 * sem
) %>% # 95% confidence interval
print(.) # another way of displaying the results in console## # A tibble: 4 x 9
## # Groups: drink [2]
## drink imagery n mean sd sem min max ci.95
## <chr> <chr> <int> <dbl> <dbl> <dbl> <int> <int> <dbl>
## 1 beer negative 20 4.45 17.3 3.87 -19 30 7.58
## 2 beer neutral 20 10 10.3 2.30 -10 28 4.51
## 3 water negative 20 -9.2 6.80 1.52 -20 5 2.98
## 4 water neutral 20 2.35 6.84 1.53 -13 12 3.00
Do the following operations in one go:
MixedAttitude.dat)gender to a categorical variable with 2 levels:
beerpos, beerneg, beerneut, winepos, waterposrename, rename the variables you kept:
data_attitude_exercise.csvPlotting is one of the most satisfying things to do in R, especially if you use the package ggplot2… part of the tidyverse!
In a nutshell, ggplot2 allows you to build plots iteratively using a series of layers. You start with a dataset and specify its aesthetics (e.g., which variable should be represented on the x-axis?). Later, you can add layers with annotations, statistical summaries, and so on.
Let’s start by creating a basic bar plot. We will use the data frame with the summary statistics that we just created.
summary.att.filter.select.recode.long.sep %>%
ggplot(
., # data
aes(
x = drink, # 'drink' variable on x-axis
y = mean, # mean ratings on y-axis
fill = imagery
)
) + # separate colors for each level of 'imagery'
geom_bar(stat = "identity") # use the values in the data frame (no transformations)What the hell is that?!? Oh no, it’s a stacked bar graph! We don’t want that! How can we get separate bars for negative and neutral imagery?
summary.att.filter.select.recode.long.sep %>%
ggplot(
.,
aes(
x = drink,
y = mean,
fill = imagery
)
) +
geom_bar(
stat = "identity",
position = position_dodge()
) # bars are next to each otherBetter. Let’s add outlines.
summary.att.filter.select.recode.long.sep %>%
ggplot(
.,
aes(
x = drink,
y = mean,
fill = imagery
)
) +
geom_bar(
stat = "identity",
position = position_dodge(),
color = "black", # black outlines
size = 1
) # line thicknessSomething is missing… ah, error bars! Display 95% confidence intervals.
summary.att.filter.select.recode.long.sep %>%
ggplot(
.,
aes(
x = drink,
y = mean,
fill = imagery
)
) +
geom_bar(
stat = "identity",
position = position_dodge(),
color = "black",
size = 1
) +
geom_errorbar(aes(
ymin = mean - ci.95, # mean ratings ± 95% confidence interval
ymax = mean + ci.95
),
width = .2, # width of the error bars
position = position_dodge(.9)
) # position (centered on the bar)These colors are hideous… let’s use a more decent color palette.
The package viridis uses colors that are easier to distinguish for people with colorblindness.
library(viridis)
summary.att.filter.select.recode.long.sep %>%
ggplot(
.,
aes(
x = drink,
y = mean,
fill = imagery
)
) +
geom_bar(
stat = "identity",
position = position_dodge(),
color = "black",
size = 1
) +
geom_errorbar(aes(
ymin = mean - ci.95,
ymax = mean + ci.95
),
width = .2,
position = position_dodge(.9)
) +
scale_fill_viridis(
option = "viridis", # see ?scale_color_viridis for other color palettes
discrete = TRUE
) # map colors to discrete valuesLet’s add a final cosmetic touch.
summary.att.filter.select.recode.long.sep %>%
ggplot(
.,
aes(
x = drink,
y = mean,
fill = imagery
)
) +
geom_bar(
stat = "identity",
position = position_dodge(),
color = "black",
size = 1
) +
geom_errorbar(aes(
ymin = mean - ci.95,
ymax = mean + ci.95
),
width = .2,
position = position_dodge(.9)
) +
scale_fill_viridis(
option = "viridis",
discrete = TRUE
) +
scale_y_continuous("", # y-axis: no title
limits = c(-15, 15), # y-axis: min/max values
breaks = seq(-15, 15, 5)
) + # y-axis: tick marks
ggtitle("mean ratings") + # plot title
theme_classic(base_size = 18) + # text size
theme(
plot.title = element_text(
size = 24, # title: text size
hjust = .5
), # title: centered
legend.position = c(.9, .9)
) # legend position (upper right corner)No matter how pretty a bar graph is, it remains a suboptimal way of displaying your data.
A better solution is to use RDI plots, which show Raw data, Descriptive & Inferential statistics.
The following graph shows:
att.filter.select.recode.long.sep %>% # we need the complete data frame, not the summary
unite("condition", c(drink, imagery)) %>% # paste 'drink' and 'imagery' columns into 'condition'
# base plot
ggplot(
.,
aes(
x = condition,
y = ratings
)
) +
# box and whisker plot
geom_boxplot(
alpha = 1, # boxes: transparency
size = .5, # boxes: line thickness
outlier.alpha = 0
) + # outliers: transparency
stat_boxplot(
geom = "errorbar", # whiskers
size = .5, # whiskers: line thickness
width = .25
) + # whiskers: width
# violin plot
geom_violin(aes(fill = condition), # density: color fill
color = "transparent", # outline: color
alpha = .25
) + # density: transparency
# jittered data points
geom_jitter(
size = 3, # point: size
alpha = .3, # point: transparency
position = position_jitter(width = .1)
) + # point: jitter
scale_fill_viridis(
option = "viridis", # color palette for all fills
discrete = TRUE
) +
scale_color_viridis(
option = "viridis", # color palette for all outlines
discrete = TRUE
) +
scale_x_discrete(
limits = # x-axis: set variable order
c(
"water_neutral", "water_negative",
"beer_neutral", "beer_negative"
)
) +
scale_y_continuous(
name = "", # y-axis: title
limits = c(-25, 35), # y-axis: min/max values
breaks = seq(-25, 35, 5)
) + # y-axis: tick marks
geom_hline(
yintercept = seq(-25, 35, 5), # reference lines
linetype = "dotted", # reference lines: type
colour = "#999999", # reference lines: color
size = .8, # reference lines: thickness
alpha = .5
) + # reference lines: transparency
ggtitle("mean ratings") + # plot title
theme_classic(base_size = 18) + # custom theme (resize text)
theme(
legend.position = "none", # no legend
plot.title = element_text(size = 24, hjust = .5)
) # resize and center titleIf you don’t want to waste time doing it yourself, I recommend the yarrr package.
The pirateplot function creates a graph showing:
library(yarrr)
pirateplot(
formula = ratings ~ imagery + drink, # dependent ~ independent variables
data = att.filter.select.recode.long.sep, # data frame
main = "mean ratings", # plot title
ylim = c(-25, 35), # y-axis: axis limits
ylab = "", # y-axis: no label
inf.method = "ci", # type of inference: 95% confidence interval
inf.within = participant, # ID variable
# theme settings
pal = "espresso", # color palette: see piratepal("all")
point.o = .5, # data points: opacity (0-1)
point.cex = 1.3, # data points: size
bean.b.o = .6, # bean border: opacity (0-1)
bean.f.o = .6, # bean filling: opacity (0-1)
cap.beans = TRUE, # bean densities are capped at the data limits
bty = "n", # no box around the plot
gl.col = "gray", # background line color (major and minor lines)
gl.lwd = 1, # background line width
gl.lty = 2
) # background line type (dashed)The graphs above suggest that negative imagery may have influenced ratings to water more than to beer.
Compute the following operations in one go:
data_attitude_exercise.csv (use a function from the tidyverse instead of read.table)facet_grid)viridis color paletteCompute the following operations in one go:
data_attitude_exercise.csvyarrr package, create an RDI graph similar to the bar graph of Exercise 2.1 (i.e., separate graphs for female and male participants)pirateplot options), e.g., change color palette (hint: type piratepal("all"))The aim of this study was to assess whether negative imagery would influence the likeness ratings of alcoholic beverages. It’s time to statistically verify this hypothesis.
We will run a 2 (drink: water, beer) x 2 (imagery: neutral, negative) repeated measures ANOVA on these ratings.
For this demonstration I chose the package afex, authored by Henrik Singmann.
The code below shows how to run an ANOVA with this versatile and user-friendly package.
library(afex)
library(multcomp) # we didn't explicitly install this package, but it's part of the dependencies of 'afex'
rmANOVA.att <- aov_ez("participant", # variable with subject identifier
"ratings", # dependent variable
att.filter.select.recode.long.sep, # data frame
within = c("drink", "imagery"), # within-subject variables
type = 3
) # type-III sums of squares (default in SPSS)
rmANOVA.att## Anova Table (Type 3 tests)
##
## Response: ratings
## Effect df MSE F ges p.value
## 1 drink 1, 19 252.84 8.97 ** .19 .007
## 2 imagery 1, 19 82.92 17.63 *** .13 .0005
## 3 drink:imagery 1, 19 26.66 6.75 * .02 .02
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
The results show:
The drink x imagery interaction is statistically significant… let’s run paired comparisons. afex uses functions included in the emmeans and multcomp packages.
## Welcome to emmeans.
## NOTE -- Important change from versions <= 1.41:
## Indicator predictors are now treated as 2-level factors by default.
## To revert to old behavior, use emm_options(cov.keep = character(0))
# set afex_options as below to use appropriate degrees of freedom (not Satterthwaite approximated)
# for details, see https://cran.r-project.org/web/packages/afex/vignettes/afex_anova_example.html#post-hoc-contrasts-and-plotting
afex_options(emmeans_model = "multivariate")
posthoc.att <-
emmeans(rmANOVA.att, ~ imagery:drink) %>% # estimated marginal means
pairs(., # compare differences between estimated marginal means
test = adjusted("free")
) %>% # "free": generalization of Bonferroni-Holm correction, taking into account correlations among model parameters
as.glht(.) %>% # better p-value adjustment for multiple testing
summary(.) %>% # cleaner output
print(.)##
## Simultaneous Tests for General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## negative,beer - neutral,beer == 0 -5.550 2.604 -2.131 0.1425
## negative,beer - negative,water == 0 13.650 4.329 3.153 0.0187 *
## negative,beer - neutral,water == 0 2.100 4.997 0.420 0.9613
## neutral,beer - negative,water == 0 19.200 2.934 6.544 <0.001 ***
## neutral,beer - neutral,water == 0 7.650 3.034 2.521 0.0693 .
## negative,water - neutral,water == 0 -11.550 2.044 -5.652 <0.001 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
The output has all you need to report in a publication… well, almost. Something is missing.
It is good practice to report effect sizes along with p-values, so that readers can make their own mind with respect to the importance of the observed effects. Even better, you could report confidence intervals around effect sizes, so that readers can have a clear picture of the precision of your estimation.
I particularly like the idea of bootstrapping effect sizes, a better approach when the data are known not to be normally distributed or when the distribution is unknown4. I will show you how to do it using the package bootES5.
We will compute Hegdes’s g, an unbiased estimate of \(\delta\) (for details, see here)
Because our dependent variable consists of ratings collected from the same participants in different conditions (i.e., repeated measures), we must first manually compute the difference scores between our contrasts of interest (see the output of the paired comparisons above).
att.bootES <-
att.filter.select.recode.long.sep %>%
unite("condition", c(drink, imagery)) %>% # create 'condition' variable
spread(condition, ratings) %>% # convert from long to wide format
# compute mean differences
mutate(
beer_negativeVSbeer_neutral = beer_negative - beer_neutral,
beer_negativeVSwater_negative = beer_negative - water_negative,
beer_negativeVSwater_neutral = beer_negative - water_neutral,
beer_neutralVSwater_negative = beer_neutral - water_negative,
beer_neutralVSwater_neutral = beer_neutral - water_neutral,
water_negativeVSwater_neutral = water_negative - water_neutral
) %>%
# delete unused variables
# here we must specify that we want to use the 'select' function from 'dplyr' (part of the 'tidyverse'), because there is another function ('lm.ridge' in the 'MASS' package) that creates conflict
dplyr::select(-c(beer_negative:water_neutral)) %>%
# re-convert to long format
gather(
key = "diff.conds",
value = "ratings",
beer_negativeVSbeer_neutral:water_negativeVSwater_neutral
)
head(att.bootES, n = 5)Now we can calculate the standardized effect size for each difference scores. Using functions in the purrr package (part of the tidyverse!), we will create separate lists for each difference score and calculate bootstrapped Hegdes’s g for each of them.
library(bootES)
att.HedgesG <-
att.bootES %>%
split(.$diff.conds) %>% # split difference scores in separate lists
# apply bootES function to all lists
map(~ bootES(., # data
data.col = "ratings", # dependent variable
R = 5000, # number of samples
effect.type = "hedges.g", # type of effect size
ci.type = "bca", # bootstrap method
ci.conf = .95
)) # confidence levelThe result is a data frame containing 6 lists (i.e., the number of diff.conds), each containing the results of the bootstrapping procedure. As an example, let’s look at the list containing the bootstrapped Hegdes’s g for the difference ratings of beer after seeing negative vs. neutral imagery:
##
## 95.00% bca Confidence Interval, 5000 replicates
## Stat CI (Low) CI (High) bias SE
## -0.457 -0.939 0.054 -0.039 0.256
The final step is to summarize the information stored in different lists in one single data frame that can later be converted into a table and cleaned up for publication. This is one way to do it:
# initialize summary table: what do you want to report?
summary.att.bootES <- data.frame(magrittr::extract2(posthoc.att, 1)$object, # extract2, a function of 'magrittr' (part of the 'tidyverse'!), extracts values from lists (similarly to [[]] in base R)
"Hedges.g" = NA,
"CI.95.low" = NA,
"CI.95.high" = NA,
"bias" = NA,
"std.error" = NA
)
# OH MY GOD A LOOP IN R!
for (i in 1:length(unique(att.bootES$diff.conds))) { # loop through conditions
summary.att.bootES[i, 7:11] <- summary(magrittr::extract2(att.HedgesG, i)) # extract values from the list with bootstrapped Hedges's g
}
summary.att.bootESRun the same analyses using the exercise dataset (i.e., wine instead of beer):
Run a 2 (gender) x 2 (drink) x 2 (imagery) mixed ANOVA on likeness ratings:
gender is a between-subject factor!Hopefully, I managed to give you a glimpse of the versatility, readability, and user-friendliness of the tidyverse. Enjoy your tidy new life!
Thanks!
Hopefully you’re not familiar with pretending to collect data… if so, please tell your story by writing a book.↩
As you may imagine, today we won’t have time to cover all the amazing things you can do with these packages. Also, they are a gift that keeps on giving: I have been using them for a while and I keep discovering useful functions. If you want to learn more, read R for Data Science by Garrett Grolemund and Hadley Wickham.↩
This is a modified version of a dataset included in the book Discovering Statistics Using R (Field, Miles & Field, 2012).↩
If you prefer not to bootstrap your effect sizes, you can use the MBESS package (a useful tutorial can be found here). If you want to do it in SPSS, see here.↩
A clear explanation of how to use bootES is provided by the authors of the package in their paper.↩
schettino@eur.nl